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Abstract. As the amount of user-generated content dramatically rises, the 
need to structure these data, to extract relevant semantic relationships bur- 
ied in the data, and to visualize found relationships appropriately has sig- 
nificantly risen as well. 

We suggest an innovative method to structure, visualize, and visually ex- 
plore user-generated data using a cartogram of a self-organizing map. This 
distorted self-organizing map overcomes the cognitive limitations of the 
traditional self-organizing map by combining this neural network mapping 
approach with cartographic methods used to generate cartograms. 

First, our novel mapping approach is put to a rigorous test in a case study 
aimed to uncover the latent semantic structure from text documents in the 
Wikipedia Encyclopedia. Second, the latent structure uncovered with the 
self-organizing map cartogram is systematically evaluated by comparing it 
to an established network visualization method and output. 

The resulting self-organizing map cartogram reveals relevant structures in 
the considered Wikipedia data. The comparative evaluation confirms the 
validity and the stability of the found patterns, and therefore of our novel 
visualization solution. 

This paper further contributes to the spatialization research line, by ex- 
panding the use of well-established and empirically evaluated cartographic 
depiction methods to the visualization of non-geographic data, such as, for 
example, user-generated data increasingly available in today's networked 
information society. 
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gram, network visualization 



1. Introduction 



The availability and the amount of user-generated content on the WWW 
rise dramatically. Wikipedia and Twitter are two well-known examples of 
massive text and graphic-based, crowd-sourced, and freely available online 
databases. In order to systematically analyze and explore such massive 
semi-structured semantic databases with visual analytics methods, cartog- 
raphers may contribute with perceptually salient and cognitively adequate 
mapping solutions, as presented in this paper. 

Following the line of research at the interface of cartography and infor- 
mation visualization, Skupin & Fabrikant (2005, 2007) introduced the spa- 
tialization framework which suggests a systematic approach to transform 
high-dimensional data sets into lower-dimensional, spatial representations 
for facilitating data exploration and knowledge production using spatial 
metaphors. Their approach pays tribute to different traditional and empiri- 
cally evaluated cartographic design principles, such as the theory of the vis- 
ual variables (Bertin 1974, MacEachren 1995), and the cartographic gener- 
alization process (McMaster 1989, Buttenfield & McMaster 1991), for ex- 
ample, and integrates established dimension reduction techniques from 
information visualization, such as self-organizing maps (SOM) and net- 
work visualizations. 

A self-organizing map, in essence a neural network, projects input data onto 
a two dimensional, topological space, typically represented by a regular 
tessellation (i.e., hexagon) including neurons (Kohonen 2001). The neurons 
in the SOM have the same attributes as the input data, and are placed near 
each other if they share similar attributes, and are therefore semantically 
similar (Skupin & Agarwal 2008). The original input data are mapped as 
points onto neurons with semantically most similar attributes. However, 
traditional SOMs have an important conceptual limitation: the proximity 
metric, defined by data similarity, is not uniform across the SOM space, and 
thus violates the distance-similarity metaphor, defined by Montello et al. 
(2003). We present an innovative approach to overcome this limitation 
which combines SOM with the long-standing cartographic tradition of car- 
tograms. This new approach is an extension of the cartographically inspired 
spatialization approach presented by Skupin & de Jongh (2005), and Fabri- 
kant & Salvini (2011) at prior ICC meetings. This extended framework is 
first put to a rigorous test in a case study spatializing more than 2,000 Wik- 
ipedia articles. Second, our approach is systematically evaluated in compar- 
ison with a well-established network visualization approach using the same 
data. 



2. Methods 



The used methods are inspired by previous work of Skupin & de Jongh 
(2005), and Fabrikant & Salvini (2011) who semantically explored and ana- 
lyzed the ICC conference proceedings using the SOM and the network visu- 
alization techniques. In this study, we analyze a set of Wikipedia articles, 
and extend the spatialization framework to the neuronal network space in 
the SOM, applying the cartogram techniques. In doing so, we intend to en- 
hance the cognitive plausibility and the visual saliency of the resulting visu- 
alization. 

2.1. Data 

As this proposed approach is a part of a larger project intended to uncover 
the functional structure of the regional organization in the Eastern parts of 
Switzerland, we considered only the German version of Wikipedia. The se- 
mantic corpus consists of the titles of 2,158 Wikipedia articles including the 
standard description of 8,812 related categories. 

2.2. Towards the distorted self-organizing map 

As a first step, we had to analyze the semantic content captured in the arti- 
cle titles, and the standard category descriptions in Wikipedia. To do so, we 
employed the probabilistic topic model (TM) method as described in Stey- 
vers & Griffiths (2007), available in the Text Visualization Toolbox (TVT) in 
MATLAB (Hespanha & Hespanha 2011). 

Following Fabrikant & Salvini (2011) we chose 20 topics for the TM step. As 
a result we get a two-mode article-topic matrix, where each article is de- 
scribed as a vector associated with probability values for each of the 20 top- 
ics. The higher the probability, the higher the semantic similarity between 
an article and the respective topic is. 

This article-topic matrix served as the input for the SOM calculation per- 
formed with the SOM Analyst toolbox in ArcGIS, developed by Lacayo- 
Emery (2011). We chose the SOM parameters proposed by Lacayo-Emery 
(2011) and Skupin & Esperbe (2011). Following SOM guidelines as suggest- 
ed by Wendel et al. (2009), we decided to create a SOM of 30x30 neurons 
in size. 

The initial SOM, based on the two-mode article-topic matrix, was trained 
first with 4,500 runs, using a neighborhood radius of 30, to establish broad, 
global structures. In a second stage, we trained the SOM with 40,000 runs, 
and applied a neighborhood radius of 6 to carve out regional and local 
structures. The trained SOM consists of 20 different component planes 
which represent the distribution of the data values for each of the 20 TM 



input vectors. We then project the best matching unit (BMU) which con- 
sists of all considered Wikipedia articles onto the component planes. Dur- 
ing this step, each input data point is assigned to the neuron in the compo- 
nent planes which fits best with its semantic attributes. 

As a final step we calculated the U-matrix which contains a semantic simi- 
larity value for every neuron compared to all neighboring neurons in the 
space. An excerpt of the resulting component planes, the BMU and the 
neighborhood similarity is shown in Figure l. 



Trained self-organizing map 



Self-organizing map cartogram 
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Figure l. Excerpt of trained SOM including BMU (left) and SOM cartogram 
(right) 



In order to distort the SOM, based on the distance-similarity metaphor, we 
first transferred the component planes in shapefile format to the Scape 
Toad cartogram software (Andrieu et al. 2008). In Scape Toad we selected 
the U-matrix values as variable to distort the SOM space. Following the 
distance-similarity principle neurons that are semantically less similar are 
pushed apart compared to neurons that are more similar to each other. The 
resulting cartogram is transferred back to ArcGIS for further processing. As 
a next step we calculate the center points of the distorted neurons. The re- 



suiting SOM cartogram including the center points are shown in Figure l 
(SOM cartogram). 

So far we used size as a visual variable to distort the SOM according to the 
distance-similarity metaphor. Additionally to the distorted component 
planes, we get a point grid consisting of neuron center points. The distances 
between the points in the grid represent the semantic similarity between 
the neighboring neurons, according to the distance-similarity metaphor 
(Fabrikant et al. 2006), using location as an additional visual variable. 

2.3. Thematic clusters 

As a next step, we applied a clustering algorithm in order to further general- 
ize our input data, and to explore how clusters of semantically similar arti- 
cles might be distributed within the SOM cartogram. We therefore trans- 
formed the two-mode article-topic matrix (see Section 2.2) to a one-mode 
article-to-article matrix which indicates the semantic similarity between the 
input articles. We employed the Blondel community detection algorithm 
(Blondel et al. 2008), an established social network cluster algorithm for 
this step as suggested by Fabrikant & Salvini (2011). Applying this algo- 
rithm, thirteen article clusters emerge. In order to automatically describe 
the semantic content of the thirteen clusters, we employed the tf-idf method 
which extracts the most relevant terms of every cluster (Manning et al. 
2009) in Figure 2. Cluster membership is illustrated in the SOM using the 
color hue; the number of articles per neuron is visualized by scaling the 
neuron's center point using the visual variable size. 

3. Results 

In Figure 2 the SOM cartogram including the thirteen Blondel clusters and 
the number of articles per neuron are depicted. The size of the center points 
in the neurons represents the number of articles which fits best to the cor- 
responding neuron according to the BMU. The larger the center point, the 
higher the number of articles per neuron. Color hue depicts the cluster 
membership representing the majority of articles within a neuron. White 
center points depict neurons where none of the thirteen clusters represents 
more than 50% of the articles at that center point. The three most relevant 
terms are listed in the legend of Figure 2, below a general content descrip- 
tion for each cluster. 
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Figure 2. SOM cartogram including cluster membership and number of repre- 
sented articles 



Figure 2 reveals that most clusters are homogeneous regions in the SOM. 
However, the two People &jobs clusters, both located in the center of the 
SOM, and the clusters Art and Places of interest are not as homogenous as 
the other clusters. Articles belonging to the cluster Places of interest are 
even separated into two isolated regions in the SOM. One of these regions is 
located in the upper-right, and the other in the lower-left corner of the map. 
It is also obvious that the clusters in the middle of the SOM are semantically 
more ambiguous, compared to the clusters at the corners or at the edges of 
the SOM. Due to the cartogram distortion of the SOM, the semantic similar- 
ity between neighboring neurons can be identified. For example, the distor- 
tions pattern within a yellow cluster called Transportation on the left of the 
SOM is interesting. While in the middle of the yellow cluster the neurons 
are quite small, and thus articles are semantically similar, the larger neu- 
rons at the edge of this cluster highlight lower semantic similarity of a 
group of articles even if the articles belong to the same cluster. Additionally, 
the semantic dissimilarity between different clusters can be observed, look- 
ing for example at the large neurons on the border between the yellow 
Transportation and the green Administrative units clusters in the lower- 
left corner of the map. 

4. Evaluation 

We evaluate our approach by comparing the structures and the distribution 
of the clusters in the SOM cartogram with a network visualization using the 
same data as input. Second, we quantitatively assess the consistency of the 
found clusters in the SOM cartogram and in the network visualization. Fi- 
nally, we compare the distribution and the semantic content of one specific 
cluster in the two visualizations in more detail. 

4.1. Comparison of the uncovered structures 

In order to produce a network visualization of our data we input the one- 
mode article-to-article matrix (see Section 2.3) to Network Workbench 
(NWB Team 2006), following the methods presented in Fabrikant & Salvini 
(2011) to generalize and visualize a spatialization network display. To clus- 
ter the articles we again employ the Blondel community detection algo- 
rithm, as presented in Section 2.3. The resulting network is depicted in Fig- 
ure 3. 

In order to improve the legibility, and perceptual salience of this visualiza- 
tion, we applied empirically validated design principles to the network con- 
figuration (Fabrikant et al. 2004). In particular, we identified semantically 
central nodes, and aggregated less central nodes within a node cluster to its 



closest center node using ArcGIS. Then, we visualized the aggregated points 
as graduated circles, as illustrated in Figure 3. The pie charts in Figure 3 
are scaled based on the number of articles that were aggregated. We depict 
the cluster membership of the aggregated nodes with pie chart segments, 
whereas the colored segments represent the proportion of articles which 
belong to a specific cluster. The edges in Figure 3 represent the structural 
most salient linkages according to the semantic similarity relationships. 

The uncovered latent structure in the network visualization, shown in Fig- 
ure 3, fits with the latent structure depicted in the SOM cartogram (Figure 
2). A first qualitative comparison of the distribution of the thirteen clusters 
shows a noteworthy pattern: Figure 3 illustrates that there is at least one 
pie chart for every cluster where the proportion is at least 75% for this clus- 
ter. The orange cluster labeled People & jobs is the only exception. Although 
this cluster appears in many of the charts, it never reaches such a large pro- 
portion. It is also interesting to see that the red People &jobs cluster and 
the pink cluster named Art are jointly distributed across the network and in 
some charts even reach more than 20%. This pattern is also visible in the 
SOM cartogram in Figure 2, as where the clusters People &jobs and Art are 
located in the middle of the SOM cartogram thus indicating that they are 
semantically vague and close to many other clusters in the space. 

Looking at the cluster Places of interest in Figure 3 one can notice that it 
splits into three different pie charts, where the cluster reaches a proportion 
over 75%. A comparable pattern is visible in the SOM cartogram in Figure 
2, as Places of interest is the only cluster with a high number of clustered 
articles in two different regions on the map. This could be an evidence for 
the semantic diversity of this cluster. 

In comparison to the network visualization, the SOM cartogram provides a 
more nuanced picture about the document similarity. The SOM cartogram 
allows not only to recognize the semantic similarity between different clus- 
ters, but also the structure within clusters. On the other hand, because of 
the higher degree of generalization in the network visualization, the similar- 
ity between articles within the graduated pie chart is not easily accessible. 
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To quantitatively assess the consistency of the uncovered latent structures 
in the SOM cartogram, we compare it statistically with the network visuali- 
zation. In particular, we are interested to identify whether articles grouped 
in the same graduated pie charts in Figure 3 would also be grouped into the 
same regions in the SOM cartogram. We employed the k-means-algorithm 
to identify 20 regions in the SOM, as there are also 20 graduated pie charts 
in the network visualization. The small charts that appear between the large 
graduated pie charts in Figure 3 were ignored for this comparison, as they 
only consist of one article. As a next step, we calculated a square matrix 
including the co-occurrence of articles in the k-means-regions of the SOM 
and of the graduated pie charts in the network visualization. This co- 
occurrence matrix provides the basis to quantitatively assess how well the 
uncovered latent structures match in the two considered visualizations. The 
consistency between the two latent structures is defined with the Cohen's 
Kappa coefficient (Cohen i960), and the hypergeometric test (Kos & 
Psenicka 2000). 

Applying Cohen's Kappa coefficient we get a value of 0.91 which means that 
the graduated pie charts in the network visualization and in the k-means- 
regions of the SOM cartogram are very consistent. Employing the hyperge- 
ometric test to our data, a probability value of 0.00 indicates that the net- 
work visualization reproduces the clusters in the SOM cartogram very well 
and statistically significant, at a significance level of 5%. 

4.2. Distribution of semantic content within clusters 

In the previous sections we observed that the articles of the Places of inter- 
est cluster in both visualizations split into different graduated pie charts in 
the network visualization, and are distributed across different regions in the 
SOM cartogram. We therefore expect to see different semantic content in 
this cluster. For this reason we further analyzed the semantic content of this 
specific cluster. 

First, we selected the k-means-regions as described in Section 4.1 in the 
SOM cartogram, containing at least five articles from the cluster Places of 
interest. The result is depicted in Figure 4. The dark grey boundaries indi- 
cate the borders of the selected k-means-regions. These regions are labeled 
with the most relevant words appearing in the respective cluster for each of 
the analyzed k-means-regions, extracted by the tf-idf method. 
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Figure 5 illustrates the same network as in Figure 3 but only with the pie 
chart segments of the Places of interest cluster highlighted. Again we label 
the charts that contain at least five articles belonging to the cluster Places of 
interest, with the most relevant terms. 

Comparing Figure 4 and Figure 5, the cluster Places of interest splits into 
three groups (i.e., k-means-regions, and graduated pie charts) with differ- 
ent semantic content. Still, the groups in the two visualizations have a high 
semantic correspondence. In both solutions, one group includes articles 
about religious buildings, and the most relevant terms are Kloster (abbey) 
and Kirchef-ngebdude (church / buildings). The second group is about bi- 
cycle paths and religion, with the most important words Radweg (bicycle 
path), Rheinbriicke (bridge over the Rhine), and Religion (religion). The 
third group is about the Tour de Suisse, a well-known bicycle race in Swit- 
zerland, with the most important words Tour (race), Suisse (Switzerland), 
and Radrennen (bicycle race). 




Figure 5. Places of interest cluster in the network visualization 



This systematic comparison suggests that not only the global uncovered 
latent structure corresponds well in the two visualizations, but also the se- 
mantic content of the found local structures are similar in the two visualiza- 
tions. 



5. Conclusion 



In this paper, we first proposed an innovative approach to combine self- 
organizing maps with the long-standing cartographic tradition of carto- 
grams when analyzing user-generated content. Second, we analyzed and 
systematically evaluated the uncovered latent structure in the self- 
organizing map cartogram by comparing it to a well-established network 
visualization method and output. 

The main advantage of distorting SOMs compared to traditional SOMs is 
that internal structures of clusters of similar documents and the direct se- 
mantic relationships between found clusters can be identified more easily 
as the distortion pays tribute to the distance-similarity metaphor (Fabrikant 
et al. 2006). The visual variable location is ranked by Bertin (1967) as the 
most salient of the visual variables. It expresses the similarity relationships 
amongst the Wikipedia articles cognitively more plausibly and perceptually 
more saliently as other types of spatializations, for example, as compared to 
the network visualization, discussed in this paper. 

By comparing the uncovered latent structure in the novel SOM cartogram 
with an already well-established network visualization approach, we were 
able to statistically assess the correspondence of the depicted patterns. This 
systematic evaluation provides validity to our visualization solutions, and 
more generally for using self-organizing maps and network visualizations as 
spatialization methods for the scientific investigation of unstructured text 
data. 
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